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ABSTRACT 

Populating a database with unstructured information is a 
long-standing problem in industry and research that encom¬ 
passes problems of extraction, cleaning, and integration. Re¬ 
cent names used for this problem include dealing with dark 
data and knowledge base construction (KBC). In this work, 
we describe DeepDive, a system that combines database and 
machine learning ideas to help develop KBC systems, and we 
present techniques to make the KBC process more efficient. 
We observe that the KBC process is iterative, and we de¬ 
velop techniques to incrementally produce inference results 
for KBC systems. We propose two methods for incremen¬ 
tal inference, based respectively on sampling and variational 
techniques. We also study the tradeoff space of these meth¬ 
ods and develop a simple rule-based optimizer. DeepDive 
includes all of these contributions, and we evaluate Deep¬ 
Dive on five KBC systems, showing that it can speed up 
KBC inference tasks by up to two orders of magnitude with 
negligible impact on quality. 


1. INTRODUCTION 

The process of populating a structured relational database 
from unstructured sources has received renewed interest in 
the database community through high-profile start-up com¬ 
panies (e.g., Tamr and Trifacta), established companies like 
IBM’s Watson [^[^, and a variety of research efforts 
30l33l[^[4^ . At the same time, communities such as natu¬ 
ral language processing and machine learning are attacking 
similar problems under the name knowledge base eonstrue- 
tion (KBC) [G 17 2^. While different communities place 
differing emphasis on the extraction, cleaning, and integra¬ 
tion phases, all communities seem to be converging toward 
a common set of techniques that include a mix of data pro¬ 
cessing, machine learning, and engineers-in-the-loop. 

The ultimate goal of KBC is to obtain high-quality struc¬ 
tured data from unstructured information. These databases 
are richly structured with tens of different entity types in 


complex relationships. Typically, quality is assessed us¬ 
ing two complementary measures: precision (how often a 
claimed tuple is correct) and recall (of the possible tuples to 
extract, how many are actually extracted). These systems 
can ingest massive numbers of documents-far outstripping 
the document counts of even well-funded human curation ef¬ 
forts. Industrially, KBC systems are constructed by skilled 
engineers in a months-long (or longer) process-not a one- 
shot algorithmic task. Arguably, the most important ques¬ 
tion in such systems is how to best use skilled engineers’ 
time to rapidly improve data quality. In its full generality, 
this question spans a number of areas in computer science, 
including programming languages, systems, and HCI. We 
focus on a narrower question, with the axiom that the more 
rapid the programmer moves through the KBC eonstruetion 
loop, the more quiekly she obtains high-quality data. 

This paper presents DeepDive, our open-source engine for 
knowledge base constructionDeepDive’s language and ex¬ 
ecution model are similar to other KBC systems: DeepDive 
high-level declarative language ; 


33 


3S \ . From a 


database perspective, DeepDive’s language is based on SQL. 
From a machine learning perspective, DeepDive’s language 
is based on Markov Logic p!^[3^ : DeepDive’s language in¬ 
herits Markov Logic Networks’ (MLN’s) formal semantics]^ 
More over , it use s a standard execution model for such sys¬ 
tems 33 ^ in which programs go through two main 
phases: grounding, in which one evaluates a sequence of SQL 
queries to produce a data structure called a faetor graph that 
describes a set of random variables and how they are cor¬ 
related. Essentially, every tuple in the database or result 
of a query is a random variable (node) in this factor graph. 
The inference phase takes the factor graph from grounding 
and performs statisti cal infer ence using standard techniques, 
e.g., Gibbs sampling EzlEI- The output of inference is the 
marginal probability of ever y tu ple in the dat aba se. As with 
Google’s Knowledge Vault and others [^, DeepDive 
also produces marginal probabilities that are calibrated: if 
one examined all facts with probability 0.9, we would ex¬ 
pect that approximately 90% of these facts would be cor¬ 
rect. To calibrate these probabilities, DeepDive estimates 
(i.e., learns) parameters of the statistical model from data. 
Inference is a subroutine of the learning procedure and is 
the critical loop. Inference and learning are computation¬ 
ally intense (hours on 1TB RAM/48-core machines). 


^ http://deepdive.stanf ord.edu 

^DeepDive has some technical differences from Markov 
Logic that we have found useful in b uild ing applications. 
We discuss these differences in Section 
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In our experience with DeepDive, we found that KBC is 
an iterative process. In the past few years, DeepDive has 
been used to build dozens of high-quality KBC systems by 
a handful of technology companies, a number law enforce¬ 
ment agencies via DARPA’s MEMEX, and scientists in fields 
such as paleobiology, drug repurposing, and genomics. Re¬ 
cently, we compared a DeepDive system’s extractions to the 
quality of extractions provided by human volunteers over 
the last ten years for a paleobiology database, and we found 
that the DeepDive system had higher quality (both precision 
and recall) on many entities and relationships. Moreover, on 
all of the extracted entities and relationships, DeepDive had 
no worse quality [37| . Additionally, the winning entry of 
the 2014 TAC-KBC competition was built on DeepDive [^. 
In all cases, we have seen the process of developing KBC 
systems is iterative: quality requirements change, new data 
sources arrive, and new concepts are needed in the applica¬ 
tion. This led us to develop techniques to make the entire 
pipeline incremental in the face of changes both to the data 
and to the DeepDive program. Our primary technical con¬ 
tributions are to make the grounding and inference phases 
more incremental0 

Incremental Grounding. Grounding and feature extrac¬ 
tion are performed by a series of SQL queries. To make 
this phase incremental, we a dapt the algorithm of Gupta, 
Mumick, and Subrahmanian [21| . In particular, DeepDive 
allows one to specify “delta rules” that describe how the 
output will change as a result of changes to the input. Al¬ 
though straightforward, this optimization has not been ap¬ 
plied systematically in such systems and can yield up to 
360X speedup in KBC systems. 

Incremental Inference. Due to our choice of incremental 
grounding, the input to DeepDive’s inference phase is a fac¬ 
tor graph along with a set of changed data and rules. The 
goal is to compute the output probabilities computed by the 
system. Our approach is to frame the incremental mainte¬ 
nance problem as one of approximate inference. Previous 
work in the database community has looked at how machine 
learning data products ch ange in response to both to new la¬ 
bels and to new data In KBC, both the program 

and data change on each iteration. Our proposed approach 
can cope with both types of change simultaneously. 

The technical question is which approximate inference al¬ 
gorithms to use in KBC applications. We choose to study 
two popular classes of approximate inference techniques: 
sampling-based materialization (inspir ed b y sampling-based 
probabilistic databases such as MCDB [^) and variational- 
based materialization (i nspir ed by techniques for approxi¬ 
mating graphical models [^). Applying these techniques to 
incremental maintenance for KBC is novel, and it is not the¬ 
oretically clear how the techniques compare. Thus, we con¬ 
ducted an experimental evaluation of these two approaches 
on a diverse set of DeepDive programs. 

We found these two approaches are sensitive to changes 
along three largely orthogonal axes: the size of the factor 
graph, the sparsity of correlations, and the anticipated num¬ 
ber of future changes. The performance varies by up to two 
orders of magnitude in different points of the space. Our 


^ As incremental learning uses standard techniques, we cover 
it only in the full version of this paper. 


study of the tradeoff space highlights that neither materi¬ 
alization strategy dominates the other. To automatically 
choose the materialization strategy, we develop a simple 
rule-based optimizer. 

Experimental Evaluation Highlights. We used DeepDive 
programs developed by our group and DeepDive users to un¬ 
derstand whether the improvements we describe can speed 
up the iterative development process of DeepDive programs. 
To understand the extent to which DeepDive’s techniques 
improve development time, we took a sequence of six snap¬ 
shots of a KBC system and ran them with our incremental 
techniques and completely from scratch. In these snapshots, 
our incremental techniques are 22 x faster. The results for 
each snapshot differ at most by 1% for high-quality facts 
(90%+ accuracy); fewer than 4% of facts differ by more 
than 0.05 in probability between approaches. Thus, essen¬ 
tially the same facts were given to the developer through¬ 
out execution using the two techniques, but the incremental 
techniques delivered them more quickly. 


Outline. The rest of the paper is organized as follows. Sec- 
tionj^contains an in-depth analysis of the KBC development 
process, and the presentation of our language for modeling 
KBC systems. We discuss the different techniques for in¬ 
cremental maintenance in Section We also present the 
results of the exploration of the tradeoff space and the de¬ 
scription of our optimizer. Our experimental evaluation is 
presented in Section 

Related Work 

Knowledge Base Construction (KBC) KBC has been 
an area of intense stu dy over the last decade , m oving from 
pattern matching and rule-based systems [30] to systems 
that use machine learning for KBC [^[^ pT|fl8||33] . Many 
groups have studied how to improve the quality of specific 
components of KBC systems [3^[4^ . We build on this line 
of work. We formalized the development process and built 
DeepDive to ease and accelerate the KBC process, which we 
hope is of interest to many of these systems as well. Deep¬ 
Dive has many common features to Chen and Wang [12] , 
Google’s Knowledge Vault [^, and a forerunner of Deep¬ 
Dive, Tuffy [^. We focus on the incremental evaluation 
from feature extraction to inference. 


Declarative Information Extraction The database com¬ 
munity has proposed declarative languages for information 
extraction, a task with similar goals to knowledge base con¬ 
struction, by extendin g re lational operations 30 or 
rule-based approaches [^. These approaches can take ad¬ 
vantage of classic view maintenance techniques to make the 
execution incremental, but they do not study how to in¬ 
crementally maintain the result of statistical inference and 
learning, which is the focus of our work. 


Incremental Maintenance of Statistical Inference and 
Learning Related work has focused on incremental infer¬ 
ence for specific classes of graphs (tree-structured or 

low-degree graphical models). We deal instead with the 
class of factor graphs that arise from the KBC process, which 
is much more general than the ones examined in previous 
approaches. Nath and Domingos studied how to ex¬ 
tend belief propagation on factor graphs with new evidence. 
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Figure 1: A KBC system takes as input unstruc¬ 
tured documents and outputs a structured knowl¬ 
edge base. The runtimes are for the TAC-KBP com¬ 
petition system (News). To improve quality, the de¬ 
veloper adds new rules and new data. 


but without any modification to the structure of the graph. 
Wick and McCallum proposed a “query-aware MCMC” 
method. They designed a proposal scheme so that query 
variables tend to be sampled more frequently than other 
variables. We frame our problem as approximate inference, 
which allows us to handle changes to the program and the 
data in a single approach. 


2. KBC USING DEEPDIVE 

We describe DeepDive, an end-to-end framework for build¬ 
ing KBC systems with a declarative language. We first re¬ 
call standard definitions, and then introduce the essentials 
of the framework by example, compare our framework with 
Markov Logic, and describe DeepDive’s formal semantics. 

2.1 Definitions for KBC Systems 

The input to a KBC system is a heterogeneous collection 
of unstructured, semi-structured, and structured data, rang¬ 
ing from text documents to existing but incomplete KBs. 
The output of the system is a relational database containing 
facts extracted from the input and put into the appropriate 
schema. Creating the knowledge base may involve extrac¬ 
tion, cleaning, and integration. 


to extract from input documents, namely entities^ relations^ 
mentions^ and relation mentions. An entity is a real-world 
person, place, or thing. For example, “Michelle_Obama_l” 
represents the actual entity for a person whose name is 
“Michelle Obama”; another individual with the same name 
would have another number. A relation associates two (or 
more) entities, and represents the fact that there exists a 
relationship between the participating entities. For exam¬ 
ple, “Barack_Obama_l” and “Michelle_Obama_l” partic¬ 
ipate in the HasSpouse relation, which indicates that they 
are married. These real-world entities and relationships are 
described in text; a mention is a span of text in an input 
document that refers to an entity or relationship: “Michelle” 
may be a mention of the entity “Michelle_Obama_l.” A 
relation mention is a phrase that connects two mentions 
that participate in a relation such as “(Barack Obama, M. 
Obama)". The process of mapping mentions to entities is 
called entity linking. 


2.2 The DeepDive Framework 

DeepDive is an end-to-end fram ework for building KBC 
systems, as shown in Figure TJ^ We walk through each 
phase. DeepDive supports both SQL and datalog, but we 
use datalog syntax for exposition. The rules we describe in 
this section are manually created by the user of DeepDive 
and the process of creating these rules is application-specific. 


Candidate Generation and Feature Extraction. All data 
in DeepDive is stored in a relational database. The first 
phase populates the database using a set of SQL queries 
and user-defined functions (UDFs) that we call feature ex- 
traetors. By default, DeepDive stores all documents in the 
database in one sentence per row with markup produced by 
standard NLP pre-processing tools, including HTML strip¬ 
ping, part-of-speech tagging, and linguistic parsing. After 
this loading step, DeepDive executes two types of queries: 
(1) candidate mappings, which are SQL queries that produce 
possible mentions, entities, and relations, and (2) feature ex¬ 
tractors that associate features to candidates, e.g., “... and 
his wife ...” in Example ] 2. 1[ 

Example 2.2. Candidate mappings are usually simple. 
Here, we create a relation mention for every pair of can¬ 
didate persons in the same sentence (s): 

(Rl) MarriedCandidate[ml,m2)\- 

PersonCandidate[s, ml), PersonCandidate[s, m2). 


Example 2.1. Figure^ illustrates our running example: 
a knowledge base with pairs of individuals that are married 
to each other. The input to the system is a collection of 
news articles and an incomplete set of married persons; the 
output is a KB containing pairs of person that are married. 
A KBC system extracts linguistic patterns, e.g., “... and his 
wife ...” between a pair of mentions of individuals (e.g., 
“Barack Obama” and “M. Obama”). Roughly, these patterns 
are then used as features in a classifier deciding whether 
this pair of mentions indicates that they are married (in the 
HasSpouse) relation. 

We adopt standard terminology from KBC, e.g., ACE|f] 
There are four types of objects that a KBC system seeks 

^http://www.itl.nist.gov/iad/mig/tests/ace/2000/ 


Candidate mappings are simply SQL queries with UDEs 
that look like low-precision but high-recall ETL scripts. Such 
rules must be high recall: if the union of candidate mappings 
misses a fact, DeepDive has no chance to extract it. 

We also need to extract features, and we extend classical 
Markov Logic in two ways: (1) user-defined functions and 
(2) weight tying, which we illustrate by example. 

Example 2.3. Suppose that p\\rase[ml,va2, sent] returns 
the phrase between two mentions in the sentence, e.g., “and 
his wife” in the above example. The phrase between two 

^Eor more information, including examples, please see http: 
//deepdive. stanford.edu Note that our engine is built on 
Fostgres and Greenpium lor all SQL processing and UDEs. 
There is also a port to MySQL. 
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(Rl) MarriedCandidate(ml,m2) 

PersonCandidate(s,ml),PersonCandidate(s,m2). 

(FEl) MarriedMentions(ml,m2) 

MarriedCandidate(ml,m2),Mentions(s,ml), 

Mentions(s,m2),Sentence(s,sent) 

weight=phrase(ml,m2,sent). 


(3b) Supervision Rules 


(SI) MarriedMentions_Ev(ml,m2,true) 

MarriedCandidate(ml,m2), EL(ml,el), EL(m2,e2), 
Married(el,e2). 


Figure 2: An example KBC system. See Section |2.2| for details. 


mentions may indicate whether two people are married. We 
would write this as: 

(FEl) MarriedMentions[vnl,m2):- 

MarriedCandidate[vnl, m2), Mention[s, ml), 
Mention(s, m2), Sentence(s, sent) 
weight = phrase(ml, m2, sent). 

One can think about this like a classifier: This rule says 
that whether the text indicates that the mentions ml and 
m2 are married is influenced by the phrase between those 
mention pairs. The system will infer based on training data 
its confidence (by estimating the weight) that two mentions 
are indeed indicated to be married. 

Technically, phrase returns an identifier that determines 
which weights should be used for a given relation mention 
in a sentence. If phrase returns the same result for two re¬ 
lation mentions, they receive the sa me w eight. We explain 
weight tying in more detail in Section [23] In general, phrase 
could be an arbitrary UDF that operates in a per-tuple fash¬ 
ion. This allows DeepDive to support common examples of 
features such as “bag-of-words” to context-aware NLP fea¬ 
tures to highly domain-specific dictionaries and ontologies. 
In addition to specifying sets of classifiers, DeepDive inherits 
Markov Logic’s ability to specify rich correlations between 
entities via weighted rules. Such rules are particularly help¬ 
ful for data cleaning and data integration. 

Supervision. Just as in Markov Logic, DeepDive can use 
training data or evidence about any relation; in particular, 
each user relation is associated with an evidence relation 
with the same schema and an additional field that indicates 
whether the entry is true or false. Continuing our exam¬ 
ple, the evidence relation MarriedMentions_Ev could con¬ 
tain mention pairs with positive and negative labels. Oper¬ 
ationally, two standard techniques generate training data: 
(1) hand-labeling, and (2) distant supervision, which we il¬ 
lustrate below. 

Example 2.4. Distant supervision l2^\32^ is a popular 
technique to create evidence in KBC systems. The idea is to 
use an incomplete KB of married entity pairs to heuristically 
label (as True evidence) all relation mentions that link to a 


pair of married entities: 

(SI) MarriedMentions_Ev{ml,m2, true): - 

MarriedCandidates[ml, m2), EL[ml, el), 
EL[m2, e2), Married[el, e2). 

Here, Married is an (incomplete) list of married real-world 
persons that we wish to extend. The relation EL is for “en¬ 
tity linking” that maps mentions to their candidate entities. 
At first blush, this rule seems incorrect. However, it gen¬ 
erates noisy, imperfect examples of sentences that indicate 
two people are married. Machine learning techniques are 
able to exploit redundancy to cope with the noise and learn 
the relevant phrases (e.g., “and his wife” ). Negative exam¬ 
ples are generated by relations that are largely disjoint (e.g. , 
siblings). Similar to DIPRE 0 and Hearst patterns 
distant supervision exploits the “duality” 0 between patterns 
and relation instances; furthermore, it allows us to integrate 
this idea into DeepDive’s unified probabilistic framework. 


Learning and Inference. In the learning and inference 
phase, DeepDive generates a factor grap h, si milar to Markov 
Logic, and uses techniques from Tuffy [^. The inference 
and learning are done using standard techniques (Gibbs 
Sampling) that we describe below after introducing the for¬ 
mal semantics. 

Error Analysis. DeepDive runs the above three phases in 
sequence, and at the end of the learning and inference, it 
obtains a marginal probability p for each candidate fact. To 
produce the final KB, the user often selects facts in which 
we are highly confident, e.g., p > 0.95. Typically, the user 
needs to inspect errors and repeat, a process that we call 
error analysis. Error analysis is the process of understand¬ 
ing the most common mistakes (incorrect extractions, too- 
specific features, candidate mistakes, etc.) and deciding how 
to correct them [39] . To facilitate error analysis, users write 
standard SQL queries. 

2.3 Discussion of Design Choices 

We have found three related aspects of the DeepDive ap¬ 
proach that we believe enable non-computer scientists to 
write DeepDive programs: (1) there is no reference in a 
DeepDive program to the underlying machine learning al¬ 
gorithms. Thus, DeepDive programs are declarative in a 
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Grounding 


strong sense. Probabilistic semantics provide a way to de¬ 
bug the system independently of any algorithm. (2) Deep- 
Dive allows users to write feature extraction code in familiar 
languages (Python, SQL, and Scala). (3) DeepDive fits into 
the familiar SQL stack, which allows standard tools to in¬ 
spect and visualize the data. A second key property is that 
the user constructs an end-to-end system and then refines 
the quality of the system in a pay-as-you-go way |31| . In 
contrast, traditional pipeline-based ETL scripts may lead to 
time and effort spent on extraction and integration-without 
the ability to evaluate how important each step is for end- 
to-end application quality. Anecdotally, pay-as-you-go leads 
to more informed decisions about how to improve quality. 

Comparison with Markov Logic. Our language is based 
on Markov Logic [16[|35| , and our current language inherits 
Markov Logic’s formal semantics. However, there are three 
differences in how we implement DeepDive’s language: 

Weight Tying. As shown in rule FEl, DeepDive allows 
factors to share weights across rules, which is used in ev¬ 
ery DeepDive system. As we will see declaring a classifier is 
a one-liner in DeepDive: Class(x) :-R(x, f) with weight = 
w(f) declares a classifier for objects (bindings of x); R(x, f) 
indicates that object x has features f. In standard MLNs, 
this would require one rule for each feature]^ In MLNs, 
every rule introduces a single weight, and the correlation 
structure and weight structure are coupled. DeepDive de¬ 
couples them, which makes writing some applications easier. 

User-defined Functions. As shown in rule FEl, DeepDive 
allows the user to use user-defined functions (phrase in FEl) 
to specify feature extraction rules. This allows DeepDive to 
handle common feature extraction idioms using regular ex¬ 
pressions, Python scripts, etc. This brings more of the KBC 
pipeline into DeepDive, which allows DeepDive to find op¬ 
timization opportunities for a larger fraction of this pipeline. 


I-^ 

User Relations Factor Graph 



Inference Rules 


Q q(x) R(x,y), S(y) 


Variables V 



Factors F 


Factor function corresponds to 
Equation 1 in Section 2.4. 


Figure 3: Schematic illustration of grounding. Each 
tuple corresponds to a Boolean random variable and 
node in the factor graph. We create one factor for 
every set of groundings. 


Semantics 

g(Ti) 

Linear 

TL 

Ratio 

log(l +n) 

Logical 

l{n>0} 


Figure 4: Semantics for g in Equation 


to a specific value, e.g., as specified in a supervision rule or 
by training data. Thus, V has two parts: a set £ of evi¬ 
dence variables (those fixed to a specific values) and a set 
Q of query variables whose value the system will infer. The 
class of evidence variables is further split into positive evi¬ 
dence and negative evidence. We denote the set of positive 
evidence variables as T, and the set of negative evidence 
variables as N. An assignment to each of the query vari¬ 
ables yields a possible world I that must contain all positive 
evidence variables, i.e., I D fP, and must not contain any 
negatives, i.e., I H N = 0. 


Implication Semantics. In the next section, we introduce 
a function g that counts the number of groundings in dif- 
ferent ways, g is an example of transformation groups 
Ch. 12], a technique from the Bayesian inference literature 
to model different noise distributions. Experimentally, we 
show that different semantics (choices of g) affect the qual¬ 
ity of KBC applications (up to 10% in El score) compared 
with the default semantics of MLNs. After some notation, 
we give an example to illustrate how g alters the semantics. 

2.4 Semantics of a DeepDive Program 

A DeepDive program is a set of rules with weights. Dur¬ 
ing inference, the values of all weights w are assumed to 
be known, while, in learning, one finds the set of weights 
that maximizes the probability of the evidence. As shown 
in Eigurej^ a DeepDive p rog ram defines a standard struc¬ 
ture called a factor graph [^. Eirst, we directly define the 
probability distribution for rules that involve weights, as it 
may help clarify our motivation. Then, we describe the cor¬ 
responding factor graph on which inference takes place. 

Each possible tuple in the user schema-both IDB and 
EDB predicates-defines a Boolean random variable (r.v.). 
Let V be the set of these r.v.’s. Some of the r.v.’s are fixed 

®Our system Tuffy introduced this feature to MLNs, but its 
semantics had not been described in the literature. 


Boolean Rules We first present the semantics of Boolean 
inference rules. Eor ease of exposition only, we assume that 
there is a single domain D. A rule y is a pair (q, w) such that 
q is a Boolean query and w is a real number. An example 
is as follows: 

q() weight = w. 

We denote the body predicates of q as bod'y(z) where z are 
all variables in the body of q(), e.g., z = (x,^) in the example 
above. Given a rule y = (q,w) and a possible world I, we 
define the sign of y on I as sign(y, I) = 1 if q() G I and —1 
otherwise. 

Given c G a grounding of q w.r.t. c is a substitution 
bod'y(z/c), where the variables in z are replaced with the 
values in c. Eor example, for q above with c = (a,b) then 
bod'y(z/(a, b)) yields the grounding R(a, b),S(b), which is 
a conjunction of facts. The support n(y,I) of a rule y in 
a possible world I is the number of groundings c for which 
bod'y(z/c) is satisfied in I: 

n(y, I) = |{c e D'^' : I |= body{z/c) }| 

The weight ofy in I is the product of three terms: 

w(y,I) =wsign(y,I) g(n(Y, I)), (1) 
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where g is a real-valued function defined on the natural num¬ 
bers. For intuition, if w(y, 1) > 0, it adds a weight that 
indicates that the world is more likely. If w(y, 1) < 0, it in¬ 
dicates that the world is less likely. As motivated above, we 
introduce g to support multiple semantics. Figure shows 
choices for g that are supported by DeepDive, which we 
compare in an example below. 

Let r be a set of Boolean rules, the weight of F on a 
possible world 1 is defined as 

W(r,I) = Xw(y,I). 

yer 

This function allow us to define a probability distribution 
over the set J of possible worlds: 

Pr[I] = Z-iexp(W(r,I)) where Z = ^exp(W(r, I)), (2) 

IS) 

and Z is called the partition function. This framework is able 
to compactly specify much more sophisticated distributions 
than traditional probabilistic databases |42| . 

Example 2.5. We illustrate the semantics by example. 
From the Web, we could extract a set of relation mentions 
that supports “Barack Obama is born in Hawaii” and an¬ 
other set of relation mentions that support “Barack Obama 
is born in Kenya.” These relation mentions provide conflict¬ 
ing information, and one common approach is to “vote. ” We 
abstract this as up or down votes about a fact q(). 

q():-Up(x) weight = 1. 

q():- Down(x) weight =—1. 

We can think of this as a having a single random variable 
q() in which the size o/Up (resp. Down^ is an evidence 
relation that indicates the number of “Up” (resp. “Down”) 
votes. There are only two possible worlds: one in which 
q() G 1 true) and not. Let |Up| and |Down| be the sizes 
o/Up and Down. Following Equation^and^ we have 


where 

W = g(|Up|) - g(|Down|). 

Consider the case when |Up| = 10® and |Down| = 10® —100. 
In some scenarios, this small number of differing votes could 
be due to random noise in the data collection processes. One 
would expect a probability for q() close to 0.5. In the linear 
semantics g(n) = n, the probability of q zs (1 H- e“^®®)“^ ^ 
1 —e“^®®, which is extremely close to 1. In contrast, if we set 
g(n) = log(l +n), then Pr[q()] ^ 0.5. Intuitively, the prob¬ 
ability depends on their ratio of these votes. The logical se¬ 
mantics g(n) = lrL>o gives exactly Pr[q()] = 0.5. However, 
it would do the same z/|Down| = 1. Thus, logical semantics 
may ignore the strength of the voting information. At a high 
level, ratio semantics can learn weights from examples with 
different raw counts but similar ratios. In contrast, linear is 
appropriate when the raw counts themselves are meaningful. 

No semantic subsumes the other, and each is appropriate 
in some application. We have found that in many cases the 
ratio semantics is more suitable for the application that the 
user wants to model. We show in the full version that these 


semantics also affect efficiency empirically and theoretically- 
even for the above simple example. Intuitively, sampling 
converges faster in the logical or ratio semantics because 
the distribution is less sharply peaked, which means that 
the sampler is less likely to get stuck in local minima. 

Extension to General Rules. Consider a general infer¬ 
ence rule y = (q,w), written as: 

q('y): - bod'y(z) weight = w(x). 

where x C z and y C z. This extension allows weight tying_. 
Given b G where b^ (resp. by) are the values of b 

in X (resp. y), we expand y to a set F of Boolean rules by 
substituting x Uy with values from D in all possible ways. 

r = I qg^():-body(z/b) and =w{x/h^)} 

where each q^y () is a fresh symbol for distinct values of bt, 
and Wy^ is a real number. Rules created this way may have 
free variables in their bodies, e.g., q(x) :-R(x,y,z) withw(y) 
create |Dp different rules of the form q^O : - R(a, b, z), one 
for each (a,b) G D^, and rules created with the same value 
of b share the same weight. Tying weights allows one to 
create models succinctly. 

Example 2.6. We use the following as an example: 

Class(x):- R(x, f) weight = w(f). 

This declares a binary classifier as follows. Each binding for 
X is an object to classify as in Class or not. The relation R 
associates each object to its features. E.g., R(a, f) indicates 
that object a has a feature f. weight = w(f) indicates that 
weights are functions of feature f; thus, the same weights 
are tied across values for a. This rule declares a logistic 
regression classifier. 

It is straightforward formal extension to let weights be 
functions of the return values of UDFs as we do in DeepDive. 

2.5 Inference on Factor Graphs 

As in Figure DeepDive explicitly constructs a factor 
graph for inference and learning using a set of SQL queries. 
Recall that a factor graph is a triple (V, F, w) in which V is 
a set of nodes that correspond to Boolean random variables, 
F is a set of hyperedges (for f G F, f C V), and w : F x 
{0,1}^ ^ R is a weight function. We can identify possible 
worlds with assignments since each node corresponds to a 
tuple; moreover, in DeepDive, each hyperedge f corresponds 
to the set of groundings for a rule y. In DeepDive, V and 
F are explicitly created using a set of SQL queries. These 
data structures are then passed to the sampler, which runs 
outside the database, to estimate the marginal probability 
of each node or tuple in the database. Each tuple is then 
reloaded into the database with its marginal probability. 

Example 2.7. Take the database instances and rules in 
Figure\^as an example, each tuple in relation R, S, and Q 
is a random variable, and V contains all random variables. 
The inference rules FI and F2 ground factors with the same 
name in the factor graph as illustrated in Figure^ Both FI 
and F2 are implemented as SQL in DeepDive. 

To define the semantics, we use Equation to define 
w(f, 1) = w(y,l), in which y is the rule corresponding to 
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f. As before, we define W(F, I) = w(f, I), and then the 
probability of a possible world is the following function: 

Pr[I] = exp {W(F, I)} where Z = y~ exp{W(F, I)} 

IGJ 

The main task that DeepDive conducts on factor graphs 
is statistical inference, i.e., for a given node, what is the 
marginal probability that this node takes the value 1? Since 
a node takes value 1 when a tuple is in the output, this 
process computes the marginal probability values returned 
to users. In general, computing these marginal probabilities 
is t|P-hard [^. Like many other systems, DeepDive uses 
Gibbs sampling to estimate the marginal probability of 
every tuple in the database. 

3. INCREMENTAL KBC 

To help the KBC system developer be more efficient, we 
developed techniques to incrementally perform the ground¬ 
ing and inference step of KBC execution. 

Problem Setting. Our approach to incrementally maintain¬ 
ing a KBC system runs in two phases. (1) Incremen¬ 
tal Grounding. The goal of the incremental grounding 
phase is to evaluate an update of the DeepDive program 
to produce the “delta” of the modified factor graph, i.e., 
the modified variables AV and factors AF. This phase con¬ 
sists of relational operations, and we apply classic incremen¬ 
tal view maintenance techniques. (2) Incremental Infer¬ 
ence. The goal of incremental inference is given (AV, AF) 
run statistical inference on the changed factor graph. 

3.1 Standard Techniques: Delta Rules 

Because DeepDive is based on SQL, we are able to take 
advantage of decades of work on incremental view main¬ 
tenance. The input to this phase is the same as the in¬ 
put to the grounding phase, a set of SQL queries and the 
user schema. The output of this phase is how the output 
of grounding changes, i.e., a set of modified variables AV 
and their factors AF. Since V and F are simply views over 
the database, any view maintenance techniques can be ap¬ 
plied to incremental grounding. DeepDive uses DRed algo¬ 
rithm [21] that handles both additions and deletions. Recall 
that in DRed, for each relation Rt in the user’s schema, 
we create a delta relation^ Rf, with the same schema as Ri 
and an additional column count. For each tuple t, t.count 
represents the number of derivations of t in R^. On an up¬ 
date, DeepDive updates delta relations in two steps. First, 
for tuples in Rf, DeepDive directly updates the correspond¬ 
ing counts. Second, a SQL query called a “delta rule^ is 
executed which processes these counts to generate modified 
variables AV and factors AF. We found that the overhead 
DRed is modest and the gains may be substantial, and so 
DeepDive always runs DRED-except on initial load. 

3.2 Novel Techniques for Incremental Main¬ 
tenance of Inference 

We present three techniques for the incremental inference 
phase on factor graphs: given the set of modified variables 
AV and modified factors AF produced in the incremental 

^ For example, for the grounding procedure illustrated in 
Figure the delta rule for FI is q^(x) : —R^(x,'y). 


grounding phase, our goal is to compute the new distribu¬ 
tion. We split the problem into two phases. In the mate¬ 
rialization phase, we are given access to the entire Deep¬ 
Dive program, and we attempt to store information about 
the original distribution, denoted Pr^°^ Each approach will 
store different information to use in the next phase, called 
the inferenee phase. The input to the inference phase is the 
materialized data from the preceding phase and the changes 
made to the factor graph, the modified variables AV and 
factors AF. Our goal is to perform inference with respect to 
the changed distribution, denoted Pr^^^ For each approach, 
we study its space and time costs for materialization and the 
time cost for inference. We also analyze the empirical trade¬ 
off between the approaches in Section |3.2.4| 

3.2.1 Strawman: Complete Materialization 

The strawman approach, complete materialization, is com¬ 
putationally expensive and often infeasible. We use it to set 
a baseline for other approaches. 

Materialization Phase We explicitly store the value of 
the probability Pr[I] for every possible world 1. This ap¬ 
proach has perfect fidelity, but storing all possible worlds 
takes an exponential amount of space and time in the num¬ 
ber of variables in the original factor graph. Thus, the 
strawman approach is often infeasible on even moderate¬ 
sized graphs |j 

Inference Phase We use Gibbs sampling: even if the 
distribution has changed to Pr^^\ we only need access to 
the new factors in AFlsr and to Pr[I] to perform the Gibbs 
update. The speed improvement arises from the fact that 
we do not need to access all factors from the original graph 
and perform a computation with them, since we can look 
them up in Pr[I]. 

3.2.2 Sampling Approach 

The sampling approach is a standard technique to im¬ 
prove over the strawman approach by storing a set of possi¬ 
ble worlds sampled from the original distribution instead of 
storing all possible worlds. However, as the updated distri¬ 
bution Pr^"^^ is different from the distribution used to draw 
the stored samples, we cannot reuse them directly. We use 
a (standard) Metropolis-Hastings scheme to ensure conver¬ 
gence to the updated distribution. 

Materialization Phase In the materialization phase, we 
store a set of possible worlds drawn from the original distri¬ 
bution. For each variable, store the set of samples as a 
tuple bundle, as in MCDB [^. A single sample for one ran¬ 
dom variable only requires 1 bit of storage. Therefore, the 
sampling approach can be efficient in terms of materializa¬ 
tion space. In the KBC systems we evaluated, 100 samples 
require less than 5% of the space of the original factor graph. 

Inference Phase We use the samples to generate propos¬ 
als and adapt them to estimate the up-to-date distribution. 
This idea of using samples from similar distributions as pro¬ 
posals is standard in statistics, e.g., importance sampling, 
rejection sampling, and different variants of Metropolis-Hast¬ 
ings methods. After investigating these approaches, in Deep¬ 
Dive, we use the independent Metropolis-Hastings approach 

®Compared with running inference from scratch, the straw- 
man approach does not materialize any factors. Therefore, it 
is necessary for strawman to enumerate each possible world 
and save their probability because we do not know a priori 
which possible world will be used in the inference phase. 


7 






Algorithm 1 Variational Approach (Materialization) 

Input: Factor graph FG = (V, F), regularization parameter A, num¬ 
ber of samples N for approximation. 

Output: An approximated factor graph FG' = (V,F') 

1: samples drawn from FG. 

2: NZ {(vi,Vj): Vi and Vj are in some factor in FG}. 

3l M. covariance matrix estimated using Ii,..., Im , such that M.ij 

is the covariance between variable i and variable j. Set Mij = 0 
if (vi,vj) 0 NZ. 

4: Solve the following optimization problem using gradient de¬ 
scent and let the result be X 

arg maxx log det X 

s.t., Xick = Xlick + 1/3, 

IXki ^ A 

Xicj =0 if (vic,vj) 0 NZ 

5: for all i, j s.t. Xtj 7 ^ 0 do 

61 Add in F' a factor from (vi,Vj) with weight Xtj . 

7: end for 

8 : return FG' = (V,F'). 


[ 40 ] , which generates proposal samples and accepts these 
samples with an acceptance test. We choose this method 
only because the acceptance test can be evaluated using the 
sample, AV, and AF-without the entire factor graph. Thus, 
we may fetch many fewer factors than in the original graph, 
but we still converge to the correct answer. 

The fraction of accepted samples is called the acceptance 
rate, and it is a key parameter in the efficiency of this ap¬ 
proach. The approach may exhaust the stored samples, in 
which case the method resorts to another evaluation method 
or generates fresh samples. 

3.2.3 Variational Approach 

The intuition behind our variational approach is as fol¬ 
lows: rather than storing the exact original distribution, we 
store a factor graph with fewer factors that approximates 
the original distribution. On the smaller graph, running in¬ 
ference and learning is often faster. 

Materialization Phase The key idea of the variational 
approach is to approximate the distribution using simpler or 
sparser correlations. To learn a sparser model, use Algo¬ 
rithm]^ which is a log-determinant relaxation [43] with a fi 
penalty term [^. We want to understand its strengths and 
limitations on KBC problems, which is novel. This approach 
uses standard techniques for learning that are already im¬ 
plemented in DeepDive [^ . 

The input is the original factor graph and two parame¬ 
ters: the number of samples N to use for approximating 
the covariance matrix, and the regularization parameter A, 
which controls the sparsity of the approximation. The out¬ 
put is a new factor graph that has only binary potentials. 
The intuition for this procedure comes from graphical model 
structure learning: an entry (i, j) is present in the inverse co- 
variance matrix only if variables i and j are connected in the 
factor graph. Given these inputs, the algorithm first draws 
a set of N possible worlds by running Gibbs sampling on the 
original factor graph. It then estimates the covariance ma¬ 
trix based on these samples (Lines 1-3). Using the estimated 
covariance matrix, our algorithms solves the optimization 
problem in Line 4 to estimate the inverse covariance matrix 
X. Then, the algorithm creates one factor for each pair of 
variables such that the corresponding entry in X is non-zero. 


using the value in X as the new weight (Line 5-7). These are 
all the factors of the approximated factor graph (Line 8). 

Inference Phase Given an update to the factor graph 
(e.g., new variables or new factors), we simply apply this 
update to the approximated graph, and run inference and 
learning directly on the resulting factor graph. As shown in 
Figure l^c), the execution time of the variational approach 
is roughly linear in the sparsity of the approximated fac¬ 
tor graph. Indeed, the execution time of running statistical 
inference using Gibbs sampling is dominated by the time 
needed to fetch the factors for each random variable, which is 
an expensive operation requiring random access. Therefore, 
as the approximated graph becomes sparser, the number of 
factors decreases and so does the running time. 

Parameter Tuning We are among the first to use these 
methods in KBG applications, and there is little literature 
about tuning A. Intuitively, the smaller A is, the better the 
approximation is-but the less sparse the approximation is. 
To understand the impact of A on quality, we show in Fig¬ 
ure the quality FI score of a DeepDive program called 
News (see Section]^ as we vary the regularization parame¬ 
ter. As long as the regularization parameter A is small (e.g., 
less than 0.1), the quality does not change significantly. In 
all of our applications we observe that there is a relatively 
large “safe” region from which to choose A. In fact, for all 
five systems in Section]^ even if we set A at 0.1 or 0.01, the 
impact on quality is minimal (within 1%), while the impact 
on speed is significant (up to an order of magnitude). Based 
on Figure]^ DeepDive supports a simple search protocol to 
set A. We start with a small A, e.g., 0.001, and increase it by 
10 X until the KL-divergence is larger than a user-specified 
threshold, specified as a parameter in DeepDive. 

3.2.4 Tradeoffs 

We studied the tradeoff between different approaches and 
summarize the empirical results of our study in Figure 
The performance of different approaches may differ by more 
than two orders of magnitude, and neither of them domi¬ 
nates the other. We use a synthetic factor graph with pair¬ 
wise factor^ and control the following axes: 

(1) Number of variables in the factor graph. In our 
experiments, we set the number of variables to values 
in {2, 10, 17, 100, 1000, 10000}. 

(2) Amount of change. How much the distribution 
changes affects efficiency, which manifests itself in the 
acceptance rate: the smaller the acceptance rate is, 
the more difference there will be in the distribution. 
We set the acceptance rate to values in {1.0, 0.5, 0.1, 
0 . 01 }. 

(3) Sparsity of correlations. This is the ratio between 
the number of non-zero weights and the total weight. 
We set the sparsity to values in {0.1,0.2,0.3,0.4,0.5,1.0} 
by selecting uniformly at random a subset of factors 
and set their weight to zero. 

We now discuss the results of our exploration of the trade¬ 
off space, presented in Figure [^a-c). 


®In Figure the numbers are reported for a factor graph 
whose factor weights are sampled at random from [—0.5, 0.5]. 
We also experimented with different intervals ([—0.1,0.1], 
[—1,1], [—10,10]), but these had no impact on the tradeoff 
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Figure 5: A Summary of the tradeoffs. Left: An analytical cost model for different approaches; Right: 
Empirical examples that illustrate the tradeoff space. All converge to <0.1% loss, and thus, have comparable 
quality. 


Size of the Factor Graph Since the materialization cost of 
the strawman is exponential in the size of the factor graph, 
we observe that, for graphs with more than 20 variables, the 
strawman is significantly slower than either the sampling 
approach or the variational approach. Factor graphs arising 
from KBC systems usually contain a much larger number of 
variables; therefore, from now on, we focus on the tradeoff 
between sampling and variational approaches. 

Amount of Change As shown in Figure [^b), when the 
acceptance rate is high, the sampling approach could out¬ 
perform the variational one by more than two orders of mag¬ 
nitude. When the acceptance rate is high, the sampling ap¬ 
proach requires no computation and so is much faster than 
Gibbs sampling. In contrast, when the acceptance rate is 
low, e.g., 0.1%, the variational approach could be more than 
5x faster than the sampling approach. An acceptance rate 
lower than 0.1% occurs for KBC operations when one up¬ 
dates the training data, adds many new features, or concept 
drift happens during the development of KBC systems. 

Sparsity of Correlations As shown in Figure |^c), when 
the original factor graph is sparse, the variational approach 
can be llx faster than the sampling approach. This is be¬ 
cause the approximate factor graph contains less than 10% 
of the factors than the original graph, and it is therefore 
much faster to run inference on the approximate graph. On 
the other hand, if the original factor graph is too dense, the 
variational approach could be more than 7 x slower than the 
sampling one, as it is essentially performing inference on a 
factor graph with a size similar to that of the original graph. 

Discussion: Theoretical Guarantees We discuss the 
theoretical guarantee that each materialization strategy pro¬ 
vides. Each materialization method inherits the guarantee 
of that inference technique. The strawman approach retains 
the same guarantees as Gibbs sampling; For the sampling 
approach use standard Metropolis-Hasting scheme. Given 
enough time, this approach will converge to the true dis¬ 
tribution. For the variational approach, the guarantees are 
more subtle and we point the reader to the consistenc y of 
structure estimation of Gaussian Markov random field 
and log-determinate relaxation [^. These results are theo¬ 
retically incomparable, motivating our empirical study. 



Different Regularization Parameters 

Figure 6: Quality and number of factors of the News 
corpus with different regularization parameters for 
the variational approach. 

3.3 Choosing Between Different Approaches 

From the study of the tradeoff space, neither the sam¬ 
pling approach nor the variational approach dominates the 
other, and their relative performance depends on how they 
are being used in KBG. We propose to materialize the factor 
graph using both the sampling approach and the variational 
approach, and defer the decision to the inference phase when 
we can observe the workload. 

Materialization Phase Both approaches need samples 
from the original factor graph, and this is the dominant 
cost during materialization. A key question is “How many 
samples should we collect?” We experimented with sev¬ 
eral heuristic methods to estimate the number of samples 
that are needed, which requires understanding how likely 
future changes are, statistical considerations, etc. These ap¬ 
proaches were difficult for users to understand, so DeepDive 
takes a best-effort approach: it generates as many samples 
as possible when idle or within a user-specified time interval. 

Inference Phase Based on the tradeoffs analysis, we de¬ 
veloped a rule-based optimizer with the following set of rules: 

• If an update does not change the structure of the 
graph, choose the sampling approach. 

• If an update modifies the evidence, choose the varia¬ 
tional approach. 

• If an update introduces new features, choose the sam¬ 
pling approach. 

• Finally, if we run out of samples, use the variational 
approach. 

This simple set of rules is used in our experiments. 
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System 

# Docs 

# Reis 

# Rules 

+ vars 

# factors 

Adversarial 

5M 

1 

10 

O.IB 

0.4B 

News 

1.8M 

34 

22 

0.2B 

1.2B 

Genomics 

0.2M 

3 

15 

0.02B 

O.IB 

Pharma. 

0.6M 

9 

24 

0.2B 

1.2B 

Paleontology 

0.3M 

8 

29 

0.3B 

0.4B 


Figure 7: Statistics of KBC systems we used in ex¬ 
periments. The ^ vars and ^ factors are for factor 
graphs that contain all rules. 


Rule 

Description 

Al 

Calculate marginal probability for variables 
or variable pairs. 

FEl 

Shallow NLP features (e.g., word sequence) 

FE2 

Deeper NLP features (e.g., dependency path) 

11 

Inference rules (e.g., symmetrical HasSpouse). 

SI 

Positive examples 

S2 

Negative examples 


Figure 8: The set of rules in News. See Section |4.1| 

4. EXPERIMENTS 

We conducted an experimental evaluation of DeepDive for 
incremental maintenance of KBC systems. 

4.1 Experimental Settings 

To evaluate DeepDive, we used DeepDive programs de¬ 
veloped by our users over the last three years from pale¬ 
ontologists, geologists, biologists, a defense contractor, and 
a KBC competition. These are high-quality KBC systems: 
two of our KBC systems for natural sciences achieved quality 
comparable to (and sometimes better than) human experts, 
as assessed by double-blind experiments, and our KBC sys¬ 
tem for a KBC competition is the top system among all 
45 submissions from 18 teams as assessed by professional 
annotators. To simulate the development process, we took 
snapshots of DeepDive programs at the end of every devel¬ 
opment iteration, and we use this dataset of snapshots in the 
experiments to understand our hypothesis that incremental 
techniques can be used to improve development speed. 

Datasets and Workloads To study the efficiency of Deep¬ 
Dive, we selected five KBC systems, namely (1) News, (2) 
Genomics, (3) Adversarial, (4) Pharmacogenomics, and (5) 
Paleontology. Their names refers to the specific domains 
on which they focus. Figure illustrates the statistics of 
these KBC systems and of their input datasets. We group 
all rules in each system into six rule templates with four 
workload categories. We focus on the News system below. 

The News system builds a knowledge base between per¬ 
sons, locations, and organizations, and contains 34 different 
relations, e.g., HasSpouse or MemberOf. The input to the 
KBC system is a corpus that contains 1.8 million news ar¬ 
ticles and Web pages. We use four types of rules in News 
in our experiments, as shown in Figure error analysis 
(rule Al), candidate generation and feature extraction (FEl, 
FE2), supervision (SI, S2), and inference (II), correspond¬ 
ing to the steps where these rules are used. 

Other applications are different in terms of the quality 
of the text. We choose these systems as they span a large 
range in the spectrum of quality: Adversarial contains ad¬ 
vertisements collected from websites where each document 
may have only 1-2 sentences with grammatical errors; in 
contrast. Paleontology contains well-curated journal articles 
with precise, unambiguous writing and simple relationships. 


Adv News Gen Pha Pale 

Linear 0.72 0.32 0.47 0.52 0.74 

Logical 0.72 0.34 0.53 0.56 0.80 

Ratio 0.72 0.34 0.53 0.57 0.81 

g 1 1000 _ 

Total Execution Time (seconds) 

(a) Quality Improvement Over Time (b) Quality (F1) of Different Semantics 

Figure 10: (a) Quality improvement over time; (b) 
Quality for different semantics. 


Genomics and Pharma have precise texts, but the goal is to 
extract relationships that are more linguistically ambiguous 
compared to the Paleontology text. News has slightly de¬ 
graded writing and ambiguous relationships, e.g., “member 
of.” Rules with the same prefix, e.g., FEl and FE2, belong 
to the same category, e.g., feature extraction. 

DeepDive Details DeepDive is implemented in Scala and 
C++, and we use Greenplum to handle all SQL. All fea¬ 
ture extractors are written in Python. The statistical in¬ 
ference and learning and the incremental maintenance com¬ 
ponent are all written in C++. All experiments are run 
on a machine with four CPUs (each CPU is a 12-core 2.40 
GHz Xeon E5-4657L), 1 TB RAM, and 12 x 1TB hard drives 
and running Ubuntu 12.04. For these experiments, we com¬ 
piled DeepDive with Scala 2.11.2, g++-4.9.0 with -03 opti¬ 
mization, and Python 2.7.3. In Genomics and Adversarial, 
Python 3.4.0 is used for feature extractors. 

4.2 End-to-end Performance and Quality 

We built a modified version of DeepDive called Rerun, 
which given an update on the KBC system, runs the Deep¬ 
Dive program from scratch. DeepDive, which uses all tech¬ 
niques, is called Incremental. The results of our evalua¬ 
tion show that DeepDive is able to speed up the development 
of high-quality KBC systems through incremental mainte¬ 
nance with little impact on quality. We set the number of 
samples to collect during execution to {10, 100, 1000} and 
the number of samples to collect during materialization to 
{1000, 2000}. We report results for (1000,2000), as results 
for other combinations of parameters are similar. 

Quality Over Time We first compare Rerun and Incre¬ 
mental in terms of the wait time that developers experience 
to improve the quality of a KBC system. We focus on News 
because it is a well-known benchmark competition. We run 
all six rules sequentially for both Rerun and Incremental, 
and after executing each rule, we report the quality of the 
system measured by the El score and the cumulative exe¬ 
cution time. Materialization in the Incremental system is 
performed only once. Figure [T^a) shows the results. Using 
Incremental takes significantly less time than Rerun to 
achieve the same quality. To achieve an El score of 0.36 
(a competition-winning score). Incremental is 22x faster 
than Rerun. Indeed, each run of Rerun takes 6 hours, 
while a run of Incremental takes at most 30 minutes. 

We further compare the facts extracted by Incremental 
and Rerun and find that these two systems not only have 
similar end-to-end quality, but are also similar enough to 
support common debugging tasks. We examine the facts 
with high-confidence in Rerun (> 0.9 probability), 99% of 
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Adversarial 

Rerun Inc. X 

Rerun 

News 

Inc. 

X 

Genomics 

Rerun Inc. X 

Pharma. 

Rerun Inc. 

X 

Paleontology 

Rerun Inc. X 

Al 

1.0 

0.03 

33 X 

2.2 

0.02 

112x 

0.3 

0.01 

30 X 

3.6 

0.11 

33 X 

2.8 

0.3 

lOx 

FEl 

1.1 
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Figure 9: End-to-end efficiency of incremental inference and learning. All execution times are in hours. The 
column X refers to the speedup of Incremental (Inc.) over Rerun. 


them also appear in Incremental, and vice versa. High 
confidence extractions are used by the developer to debug 
precision issues. Among all facts, we find that at most 4% 
of them have a probability that differs by more than 0.05. 
The similarity between snapshots suggests, our incremental 
maintenance techniques can be used for debugging. 

Efficiency of Evaluating Updates We now compare Re¬ 
run and Incremental in terms of their speed in evaluating 
a given update to the KBC system. To better understand 
the impact of our technical contribution, we divide the to¬ 
tal execution time into parts: (1) the time used for feature 
extraction and grounding; and (2) the time used for statis¬ 
tical inference and learning. We implemented classical in¬ 
cremental materialization techniques for feature extraction 
and grounding, which achieves up to a 360 x speedup for rule 
FEl in News. We get this speedup for free using standard 
RDBMS techniques, a key design decision in DeepDive. 

Figure shows the execution time of statistical inference 
and learning for each update on different systems. We see 
from Figure that Incremental achieves a 7x to 112x 
speedup for News across all categories of rules. The anal¬ 
ysis rule A1 achieves the highest speedup - this is not sur¬ 
prising because, after applying Al, we do not need to rerun 
statistical learning, and the updated distribution does not 
change compared with the original distribution, so the sam¬ 
pling approach has a 100% acceptance rate. The execution 
of rules for feature extraction (FEl, FE2), supervision (SI, 
S2), and inference (II) has a lOx speedup. For these rules, 
the speedup over Rerun is to be attributed to the fact that 
the materialized graph contains only 10% of the factors in 
the full original graph. Below, we show that both the sam¬ 
pling approach and variational approach contribute to the 
speed-up. Compared with Al, the speedup is smaller be¬ 
cause these rules produce a factor graph whose distribution 
changes more than Al. Because the difference in distribu¬ 
tion is larger, the benefit of incremental evaluation is lower. 

The execution of other KBC applications showed similar 
speedups, but there are also several interesting data points. 
For Pharmacogenomics, rule II speeds-up only 3x. This 
is caused by the fact that II introduces many new factors, 
and the new factor graph is 1.4 x larger than the original 
one. In this case, DeepDive needs to evaluate those new 
factors, which is expensive. For Paleontology, we see that 
the analysis rule Al gets a lOx speed-up because as illus¬ 
trated in the corpus statistics (Figure]^, the Paleontology 
factor graph has fewer factors for each variable than other 
systems. Therefore, executing inference on the whole factor 
graph is cheaper. 

Materialization Time One factor that we need to consider 
is the materialization time for Incremental. Incremen- 



Figure 11: Study of the tradeoff space on News. 

TAL took 12 hours to complete the materialization (2000 
samples), for each of the five systems. Most of this time is 
spent in getting 2x more samples than for a single run of 
Rerun. We argue that paying this cost is worthwhile given 
that it is a one-time cost and the materialization can be used 
for many successive updates, amortizing the one-time cost. 

4.3 Lesion Studies 

We conducted lesion studies to verify the effect of the 
tradeoff space on the performance of DeepDive. In each 
lesion study, we disable a component of DeepDive, and leave 
all other components untouched. We report the execution 
time for statistical inference and learning. 

We evaluate the impact of each materialization strategy 
on the final end-to-end performance. We disabled either the 
sampling approach or the variational approach and left all 
other components of the system untouched. Figure \TT \ shows 
the results for News. Disabling either the sampling approach 
or the variational approach slows down the execution com¬ 
pared to the “full” system. For analysis rule Al, disabling 
the sampling approach leads to a more than 11 x slow down, 
because the sampling approach has, for this rule, a 100% ac¬ 
ceptance rate because the distribution does not change. For 
feature extraction rules, disabling the sampling approach 
slows down the system by 5x because it forces the use of 
the variational approach even when the distribution for a 
group of variables does not change. For supervision rules, 
disabling the variational approach is 36 x slower because the 
introduction of training examples decreases the acceptance 
rate of the sampling approach. 

Optimizer Using different materialization strategies for dif¬ 
ferent groups of variables positively affects the performance 
of DeepDive. We compare Incremental with a strong 
baseline NoWorkloadInfo which, for each group, first 
runs the sampling approach. After all samples have been 
used, we switch to the variational approach. Note that this 
baseline is stronger than the strategy that fixes the same 
strategy for all groups. Figure shows the results of the 
experiment. We see that with the ability to choose between 
the sampling approach and variational approach according 
to the workload, DeepDive can be up to 2x faster than 
NoWorkloadInfo. 
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5. CONCLUSION 

We described the DeepDive approach to KBC and our ex¬ 
perience building KBC systems over the last few years. To 
improve quality, we argued that a key challenge is to ac¬ 
celerate the development loop. We described the semantic 
choices that we made in our language. By building on SQL, 
DeepDive is able to use classical techniques to provide incre¬ 
mental processing for the SQL components. However, these 
classical techniques do not help with statistical inference, 
and we described a novel tradeoff space for approximate in¬ 
ference techniques. We used these approximate inference 
techniques to improve end-to-end execution time in the face 
of changes both to the program and the data; they improved 
system performance by two orders of magnitude in five real 
KBC scenarios while keeping the quality high enough to aid 
in the development process. 
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APPENDIX 

A. ADDITIONAL THEORETICAL DETAILS 

We find that the three semantics that we defined, namely, 
Linear, Logical, and Ratio, have impact to the convergence 
speed of the sampling procedure. We describe these results 
in Section ED and provide proof in Section pV.2| 

A.l Convergence Results 

We describe our results on convergence rates, which are 
summarized in Figure |12| We first describe results for the 
Voting program from Example |2.5[ and summarize those 
results in a theorem. We now introduce the standard metric 
of convergence in Markov Chain theory [^ . 

Definition A.l. T/ie total variation distance 5etrceen trco 
probability measures P and P' over the same sample space 
D. is defined as 

||P-P'||tv= sup|P(A)-P'(A)|, 

Aca 

that is, it represents the largest difference in the probabilities 
that P and P' assign to the same event. 

In Gibbs sampling, the total variation distance is a metric 
of comparison between the actual distribution Pi^ achieved 
at some timestep k and the equilibrium distribution tt. 

Types of Programs. We consider two families of programs 
motivated by our work in KBC: voting programs and hierar¬ 
chical programs. The voting program is used to continue the 
example, but some KBC programs are hierarchical (13/14 
KBC systems from the literature). We consider some en¬ 
hancements that make our programs more realistic. We al¬ 
low every tuple (IDB or EDB) to have its own weight, i.e., 
we allow every ground atom R(a) to have a rule Wa : R(ci). 
Moreover, we allow any atom to be marked as evidence (or 
not), which means its value is fixed. We assume that all 
weights are not functions of the number of variables (and 
so are 0(1)). Einally, we consider two types of programs. 
The first we call voting programs, which are a generaliza¬ 
tion of Example |2. 5 1 in which any subset of variables may be 
evidence and the weights may be distinct. 

Proposition A.2. Let £ > 0. Consider Gibbs sampling 
running on a particular factor graph with n variables, there 
exists a T(n) that satisfies the upper bounds in Figure\l^and 
such that IIPic — 7r||tv ^ £ after T(n) log(l -t- £“^) steps. Fur¬ 
thermore, for each of the classes, we can construct a graph 
on which the minimum time to achieve total variation dis¬ 
tance £ is at least the lower bound in Figure [71[ 

Eor example, for the voting example with logical seman¬ 
tics, T(n) = O(nlogn). The lower bounds are demonstrated 
with explicit examples and analysis. The upper bounds use 
the special form of Gibbs sampling on these factor gra phs, 
and then use a standard argument based on coupling 
and an analysis very similar to a generalized coupon collec¬ 
tor process. This argument is sufficient to analyze the voting 
program in all three semantics. 

We consider a generalization of the voting program in 
which each tuple is a random variable (possibly with a weight); 
specifically, the program consists of a variable Q, a set of 
“Up” variables U, and a set of “Down” variables D. Experi¬ 
mentally, we compute how different semantics converges on 


Problem 

OR Semantic 

Upper Bound 

Lower Bound 


Logical 

O (tl log rt) 

0.(rLlogrL) 

Voting 

Ratio 

O (tl log rt) 

0.(rLlogrL) 


Linear 

20(TX) 



Figure 12: Bounds on T(rL). The O notation hides 
constants that depend the query and the weights. 



U + D 

Figure 13: Convergence of different semantics. 

the voting program as illustrated in FigureIn this figure, 
we vary the size of |U| + |D| with |U| = |D| and all variables to 
be non-evidence variables and measure the time that Gibbs 
sampling converges to 1% within the correct marginal proba¬ 
bility of Q. We see that empirically, these semantics do have 
an effect on Gibbs sampling performance (Linear converges 
much slower than either Ratio or Logical) and we seek to 
give some insight into this phenomenon for KBG programs. 

We analyze more general hierarchical programs in an asymp¬ 
totic setting. 

Definition A. 3. A rule q(x) :-pi(xi),... ,pic(xic) is hi¬ 
erarchical if either x = 0 or there is a single variable x G x 
such that X G Xi for i = 1,..., k. A set of rules (or a Datalog 
program) is hierarchical if each rule is hierarchical and they 
can be stratified. 

Evaluation of hierarchical programs is t|P-hard, e.g., any 
Boolean query is hierarchical, but there are [tP-hard Boolean 
queries |42| . Using a similar argument, we show that any set 
of hierarchical rules that have non-overlapping bodies con¬ 
verges in time 0(N logN log(l + £“^)) for either Logical or 
Ratio semantics, where N is the number of factors. This 
statement is not trivial, as we show that Gibbs sampling 
on the simplest non-hierarchical programs may take expo¬ 
nential time to converge. Our definition is more general 
than the typical notions of safety [13| : we consider multiple 
rules with rich correlations rather than a single query over 
a tuple independent database. However, we only provide 
guarantees about sampling (not exact evaluation) and have 
no dichotomy theorem. 

We also analyze an asymptotic setting, in which the do¬ 
main size grows, and we show that hierarchical programs 
also converge in polynomial time in the Logical semantics. 
This result is a theoretical example that suggests that the 
more data we have, the better Gibbs sampling will behave. 
The key technical idea is that for the hierarchical programs 
no variable has unbounded influence on the final result, i.e., 
no variable contributes an amount depending on N to the 
final joint probability. 

A.2 Proofs of Convergence Rates 

Here, we prove the convergence rates stated in Proposi¬ 
tion The strategy for the upper bound proofs involves 
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constructing a coupling between the Gibbs sampler and an¬ 
other process that attains the equilibrium distribution at 
each step. First, we define a coupling EHID- 

Definition A. 4 (Coupling). A coupling of two ran¬ 
dom variables X and X' defined on some separate probability 
spaces P and P' is any new probability space P over which 
there are two random variables X and X' such that X has the 
same distribution as X and X' has the same distribution as 
X'. 


Given a coupling of two sequences of random variables Xi< 
and Xj^, the coupling time is defined as the first time T when 
Xk = Xj^. The following theorem lets us bound the total 
variance distance in terms of the coupling time. 

Theorem A. 5. For any coupling (Xk,Xk) with coupling 
time T, 

||P(Xke-)-P'(x;e-)l|tv<2P{T>k). 

All of the coupling examples in this section use a corre¬ 
lated flip coupler, which consists of two Gibbs samplers X 
and X', each of which is running with the same random in¬ 
puts. Specifically, both samplers choose to sample the same 
variable at each timestep. Then, if we define p as the prob¬ 
ability of sampling 1 for X, and p' similarly, it assigns both 
variables to 1 with probability min(p,p'), both variables to 
0 with probability min(l — p, 1 — p'), and assigns different 
values with probability |p—p'|. If we initialize X with an ar¬ 
bitrary distribution and X' with the stationary distribution 
71, it is trivial to check that Xk has distribution Pk, and X^. 
always has distribution tt. Applying Theorem | A. 5| results in 

||Pk-7r||tv^2P(T>k). 

Now, we prove the bounds in Figure 


Theorem A.6 (UB for Voting: Logigal and Ratio). 
For the voting example, if all the weights on the variables 
are bounded independently of the number of variables n, and 
|U| = G(n), and |D| = G(rL), then for either the logical or 
ratio projection semantics, there exists a T(n) = O(nlogn) 
such that for any £ > 0, ||Pk — 7r||tv ^ £ for any k ^ 
t(ti) log(e-i). 

Proof. For the voting problem, if we let f denote the 
projection semantic we are using, then for any possible world 
I the weight can be written as 

w= ^ wu+ ^ Wd + wCTf(|uni|)-wCTf(|Dni|), 

uGUni deDni 

where a = sign(Q, I). Let I denote the world at the current 
timestep, V be the world at the next timestep, and I_ be I 
with the variable we choose to sample removed. Then if we 
sample a variable u G U, and let mu = |U H I_|, then 

P (u € I') =_ exp (Wu +wfff(Tau + 1)) _ 

exp (Wu, + waf(mu + 1)) + exp (wcrf(mu)) 

= (1 + exp(-w^ -iva(f(mu + 1) -f(mu))))“^ . 


Since f (x + 1) — f(x) < 1 for any x, it follows that ff(f (mu + 
1) — f(mu)) ^ —1. Furthermore, since is bounded inde¬ 
pendently of n, we know that w^, ^ Wmin for some constant 
Wmin- Substituting this results in 


P(uG T) ^ 


1 

1 + exp (-Wrnin + W) 


Pmin 5 


for some constant Pmin independent of n. The same argu¬ 
ment will show that, if we sample d G D, 

P (d G F) ^ Pmin- 

Now, if |U n I| > , it follows that, for both the logical 

and ratio semantics, f(mu + 1) — f(mu) ^ —mnr- Under 

Pmin I 

these conditions, then if we sample u, we will have 
1 

l + exP (-^u+w^^) 

< P(U€ I') 

^ 1 

" l + exp(-w^-w^;^)’ 

The same argument applies to sampling d. 

Next, consider the situation if we sample Q. 

p (Q e I') = (1 + exp (-2wf(|U n I|) + 2wf(|D n . 

Under the condition that |U n I| > , and similarly for 

D, this is bounded by 

(l + exp (-2wf (EiII^IHI) _|_ 2wf(|D|)) ) 

< P(Q e I') 

< (l + exp (2wf (Llli^) _ 2wf(|U|)) ) ' . 

But, for both logical and ratio semantics, we have that 
f(cn) —f(rL) = 0(1), and so for some q+ and q_ independent 
of n, 

q_ ^ P (Q G U) ^ q+. 

Now, consider a correlated-flip coupler running for 4n log n 
steps on the Gibbs sampler. Let Ei be the event that, af¬ 
ter the first 2nlogrL steps, each of the variables has been 
sampled at least once. (This will have probability at least 
I by the coupon collector’s problem.) Let Eu be the event 
that, after the first 2nlogn steps, |U n I| ^ for the 

next 2rLlogrL steps for both samplers, and similarly for Ed- 
After all the entries have been sampled, |U H I| and |D H I| 
will each be bounded from below by a binomial random vari¬ 
able with parameter Pmin at each timestep, so from Hoeffd- 
ing’s inequality, the probability that this constraint will be 
violated by a sampler at any particular step is less than 
exp(-|pniinlU|). Therefore, 

/ /I \ \ 4n log n 

P(Eu|Ei)^ (l-expf--p,,iJU|jj =Q(1). 

The same argument shows that Ed = G(l). Let E 2 be the 
event that all the variables are resampled at least once be¬ 
tween time 2n log n and 4n log n. Again this event has prob¬ 
ability at least |. Finally, let C be the event that coupling 
occurs at time 4nlogn. Given Ei, E 2 , Eu, and Ed, this 
probability is equal to the probability that each variable 
coupled individually the last time it was sampled. For u, 
this probability is 


l + exp(-wu-w^) l + exp(-Wu+w^)’ 
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and similarly for d. We know this probability is 1 — 0(:^) 
by Taylor’s theorem. Meanwhile, for Q, the probability of 
coupling is 1 — q+ + q_, which is always 0(1). Therefore, 


Theorem A. 9 (LB for Voting: Linear). For the vot¬ 
ing example using linear projeetion semanties, the minimum 
time to aehieve total variation distanee £ is 


p(c|Ei,E2,Eu,Ed) >a(i) (i-o 

and since 

P(C) = P(C|Ei,E2,Eu,Ed)P(Eu|Ei)P(Ed|Ei)P(E2|Ei)P(Ei), 


Proof. Consider a sampler with unary weights all zero 
that starts in the state I = QUU. We will show that it takes 
an exponential amount of time until Q ^ L From above, the 
probability of flipping Q will be 

P (Q ^ I') = (1 + exp (2w(|U n I| - ID n . 



we can conclude that P(C) =0(1). 

Therefore, this process couples with at least some constant 
probability pc after dnlogn steps. Since we can run this 
coupling argument independently an arbitrary number of 
times, it follows that, after dLnlogn steps, the probability 
of coupling will be at least 1 — (1 — Pc)^- Therefore, 

||P4Lniogn -7t||tv ^ P(T > 4Lnlogn) ^ (1 -pc)‘'. 


Meanwhile, the probability to flip any u while Q G I is 
exp (0) 


P(u^ T) = 


= (exp(w) + l) =p. 


exp (w) + exp(O) 
for some p, and the probability of flipping d is similarly 

exp (—w) 


P(dG T) = 


exp (—w) + exp(O) 


= P- 


This will be less than e when L ^ , log(e) which oc- 

^ log(l-pc)’ 

curs when k ^ 4n log n . Letting t(ti) = 

proves the theorem. □ 

Theorem A. 7 (LB for Voting: Logigal and Ratio). 
For the voting example, using either the logieal or ratio pro¬ 
jeetion semanties, the minimum time to aehieve total vari¬ 
ation distanee £ is O(rLlogn). 

Proof. At a minimum, in order to converge, we must 
sample all the variables. From the coupon collector’s prob¬ 
lem, this requires O(rLlogrL) time. □ 


Theorem A.8 (UB for Voting: Linear). For the vot¬ 
ing example, if all the weights on the variables are hounded 
independently of n, then for linear projeetion semanties, 
there exists a T(rL) = 2®^^^ sueh that for any e > 0, ||Pk — 
7t||tv ^ £ for any k ^ T(rL) log(e“^). 


Proof. If at some timestep we choose to sample variable 
u. 


P(UG T) 


exp (w^ + wq) 
exp (Wu, + wq) + exp(O) 


0 ( 1 ), 


where q = sign(Q,I). The same is true for sampling d. If 
we choose to sample Q, 


P(Q G T) 

exp (iv|U n I| — w|D n I|) 

exp (w|U n I| — w|D n I|) + exp (—w|U H I| + w|D n I|) 
= (1 + exp (-2w(|U n I| - ID n I|)))^^ = exp(-0(n)). 


Therefore, if we run a correlated-flip coupler for 2rLlogrL 
timesteps, the probability that coupling will have occured is 
greater the probability that all variables have been sampled 
(which is 0(1)) times the probability that all variables were 
sampled as 1 in both samplers. Thus if pc is the probability 
of coupling. 

Pc ^ ii(l) (exp(-0(n)))^ (0(1))^°*'^' = exp(-0(n)). 

Therefore, if we run for some T(n) = 2°^^^ timesteps, cou¬ 
pling will have occurred at least once with high probability. 
The statements in terms of total variance distance follow 
via the same argument used in the previous lower bound 
proof. □ 


Now, consider the following events which could happen at 
any timestep. While Q G I, let Eu be the event that |Unl| ^ 

(1 — 2p)rL, and let Ed be the event that |D H I| ^ 2prL. Since 
|Ltnl| and |Dnl| are both bounded by binomial random vari¬ 
ables with parameters 1 — p and p respectively, Hoeffding’s 
inequality states that, at any timestep, 

P (|U n I| ^ (1 — 2p)rL) ^ exp(— 2p^rL), 

and similarly 

P (|D n I| ^ 2prL) ^ exp(— 2p^n). 

Now, while these bounds are satisfied, let Eq be the event 
that Q ^ F. This will be bounded by 

P (Eq) = (1 + exp (2w(l — 4p)rL))~^ ^ exp(— 2w(l — 4p)rL). 

It follows that at any timestep, P (Eu V Ed V Eq) = exp(—n(rL)). 
So, at any timestep k, P (Q G I) = kexp(— O(rL)). How¬ 
ever, by symmetry the stationary distribution n must have 
7r(Q G I) = |. Therefore, the total variation distance is 
bounded by 

llPk - 7t||tv ^ ^ - kexp(-D(n)). 

So, for convergence to less than £, we must require at least 
2°'’^> steps. □ 

B. ADDITIONAL SYSTEM DETAILS 
B.l Decomposition with Inactive Variables 

Our system also uses the following heuristic. The intu¬ 
ition behind this heuristic is that, the fewer variables we 
materialize, the faster all approaches are. DeepDive allows 
the user to specify a set of rules (the “interest area”) that 
identify a set of relations that she will focus on in the next 
iteration of the development. Given the set of rules, we use 
the standard dependeney graph to find relations that could 
be changed by the rules. We call variables in the relations 
that the developer wants to improve aetive variables, and 
we call inaetive variables those that are not needed in the 
next KBC iteration. Our previous discussion assumes that 
all the variables are active. The intuition is that we only 
require the active variables, but we first need to marginalize 
out the inactive variables to create a new factor graph. If 
done naively, this new factor graph can be excessively large. 

The observation that serves as the basis of our optimization 
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Algorithm 2 Heuristic for Decomposition with Inactive 

Variables_ 

Input: Factor graph FG = (^7,3^), active variables and inactive 

variables 

Output: Set of groups of variables S = {(V^b that will be 

materialized independently with other groups. 

1: connected components of the factor graph after remov¬ 

ing all active variables, i.e., V —'V^'^b 

2: minimal set of active variables conditioned on which 

is independent of other inactive variables. 

3: S 

4: for allj^^^ks.t. |vj U V*,"’ | = max{|v!“’|, |v[“’|} do 

5: g ^ gu{(v!‘'u 

6: g ^ g-{(v!‘’,vj“>),(v[‘>,v‘,“’)} 

7: end for 


is as follows: conditioning on all active variables, we can 
partition all inactive variables into sets that are condition¬ 
ally independent from each other. Then, by appropriately 
grouping together some of the sets, we can create a more 
succinct factor graph to materialize. 

Procedure. We now describe in detail our procedure to de¬ 
compose a graph in the presence of inactive variables, as 
illustrated in Algorithm 0 Let be the collection of 

all the sets of inactive variables that are conditionally in¬ 
dependent for all other inactive variables, given the aetive 
variables. This is effectively a partitioning of all the inactive 
variables (Line 1). For each set let be the minimal 
set of active variables conditioning on which the variables 
in are independent of all other inactive variables (Line 

2 ). Consider now the set S of pairs It is easy to 

see that we can materialize each pair separately from other 
pairs. Following this approach to the letter, some active ran¬ 
dom variables may be materialized multiple times, once for 
each pair they belong to, with an impact on performance. 
This can be avoided by grouping some pairs together, with 
the goal of minimizing the runtime of inference on the ma¬ 
terialized graph. Finding the optimal grouping is actually 
NP-hard, even if we make the simplification that each group 
g = has a cost function c(g) that specifies the 

inference time, and the total inference time is the sum of 
the cost of all groups. The key observation concerning the 
NP-hardness is that two arbitrary groups gi = 
and g 2 = can be materialized together to form 

a new group gs = U U and 0 ( 93 ) could 

be smaller than c(gi) + 0 ( 92 ) when gi and 92 share most 
of the active variables. This allows us to reduce the prob¬ 
lem to WeightedSetCover. Therefore, we use a greedy 
heuristic that starts with groups, each containing one inac¬ 
tive variable, and iteratively merges two groups 
and if U = max{|Vi^^|, (Line 4- 

6 ). The intuition behind this heuristic is that, according to 
our study of the tradeoff between different materialization 
approaches, the fewer variables we materialize, the greater 
the speedup for all materialization approaches will be. 

Experimental Validation. We also verified that using the 
decomposition technique has a positive impact on the per¬ 
formance of the system. We built NoDecomposition that 



Figure 14: The Lesion Study of Decomposition. 


Adv. 

News 

Gen. 

Pharma. 

Paleo. 

# Samples 2083 

3007 

22162 

1852 

2375 


Figure 15: Number of Samples We can Materialize 
in 8 Hours. 


does not decompose the factor graph and select either the 
sampling approach or the variational approach for the whole 
factor graph. We report the best of these two strategies as 
the performance of NoDecomposition in Figure We 
found that, for Al, removing structural decomposition does 
not slow down compared with DeepDive, and it is actually 
2 % faster because it involves less computation for determin¬ 
ing sample acceptance. However, for both feature extrac¬ 
tion and supervision rules, NoDecomposition is at least 
6x slower. For both types of rules, the NoDecomposition 
costs corresponds to the cost of the variational approach. In 
contrast, the sampling approach has near-zero acceptance 
rate because of a large change in the distribution. 

B.2 Materialization Time 

We show how materialization time changes across different 
systems. In the following experiment, we set the total mate¬ 
rialization time to 8 hours, a common time budget that our 
users often used for overnight runs. We execute the whole 
materialization phase of DeepDive for all systems, which will 
materialize both the sampling approach and the variational 
approach, and we require the whole process to finish in 8 
hours. Figure shows how many samples we can collect 
given this 8 -hour budget: for all five systems, in 8 hours, we 
can store more than 2000 samples, from which to generate 
1000 samples during inference if the acceptance rate for the 
sampling approach is 0.5. 

B.3 Incremental Learning 

Rerunning learning happens when one labels new docu¬ 
ments (typically an analyst in the loop), so one would like 
this to be an efficient procedure. One interesting observa¬ 
tion is that a popular technique called distant supervision 
(used in DeepDive) is able to label new documents heuristi- 
cally using rules, so that new labeled data is created without 
human intervention. Thus, we study how to incrementally 
perform learning. This is well studied in the online learning 
community, and here we adapt essentially standard tech¬ 
niques. We essentially get this for free because we choose 
to adapt standard online methods, such as stochastic gradi¬ 
ent descent with warmstart, for our training system. Here, 
warmstart means that DeepDive uses the learned model in 
the last run as the starting point instead of initializing the 
model to start randomly. 
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Figure 16: Convergence of Different Incremental 
Learning Strategies. 



design needs to consider how to not only update the model 
incrementally, but also “forget the old information” [58]. Al¬ 
though we have not observed quality issues caused by con¬ 
cept drift in our KBC applications, it is possible for Deep- 
Dive to encounter this issue in other applications. Thus, we 
study the impact of concept drift to our incremental infer¬ 
ence and learning approach. 

Impact on Performance. In this work, we solely focus on 
the efficiency of updating the model incrementally. When 
concept drift happens, the new distribution and new tar¬ 
get model would be significantly different from the materi¬ 
alized distribution as well as the learned model from last 
iteration. From our analysis in Section and Section |B.3| 
we hypothesize there are two impacts of concept drift that 
require consideration in DeepDive. First, the difference be¬ 
tween materialized distribution and the target distributed is 
modeled as the amount of change in Section and there¬ 
fore, in the case of concept drift, we expect our system favors 
variational approach more than sampling approach. Second, 
because the difference between the learned model and target 
model is large, it is not clear whether the warmstart scheme 
we discussed in Section TB.31 will work. 


Figure 17: Impact of Concept Drifts. 

Experimental Validation. We validate that, by adapting 
standard online learning methods, DeepDive is able to out¬ 
perform baseline approaches. We compare DeepDive with 
two baseline approaches, namely (1) stochastic gradient de¬ 
scent without warmstart and (2) gradient descent with warm- 
start. We run on the dataset News with rules F2 and S2, 
introduces both new features and new training examples with 
labels. We obtain a proxy for the optimal loss by running 
both stochastic gradient descent and gradient descent sep¬ 
arately for 24 hours and picking the lowest loss. For each 
learning approach, we grid search its step size in {1.0, 0.1, 
0.01, 0.001, 0.0001}, run for 1 hour, and pick the fastest one 
to reach a loss that is within 10% of the optimal loss. We 
estimate the loss after each learning epoch and report the 
percentage within the optimal loss in Figure 

As shown in Figure the loss of all learning approaches 
decrease, with more learning epochs. However, DeepDive’s 
SGD+Warmstart approach reaches a loss that achieves a 
loss within 10% optimal loss faster than other approaches. 
Compared with SGD-Warmstart, SGD+Warmstart is 2x 
faster in reaching a loss that is within 10% of the opti¬ 
mal loss because it starts with a model that has lower loss 
because of warmstart. Compared with gradient descent, 
SGD+Warmstart is about 10 x faster. Although SGD+Warm 
and gradient descent with warmstart have the same initial 
loss, SGD converges faster than gradient descent in our ex¬ 
ample. 

B.4 Discussion of Concept Drift 

Concept drift [58,63] refers to the scenario that arises in 
online learning when the training examples become avail¬ 
able as a stream, and the distribution from which these ex¬ 
amples are generated keeps changing overtime. Therefore, 
the key technique in how to resolve the problem of concept 
drift is premarily in how to design a machine learning sys¬ 
tem to automatically detect and adapt to the changes. The 


Experimental Validation. As our workloads do not have 
concept drift, we follow prior art [63], we use its dataset that 
contains 9,324 emails ordered chronologically, and predict 
exactly the same task that for each email predict whether it 
is spam or not. We implement a logistic regression classifier 
in DeepDive with a rule similar to Example |2.6| We use 
the first 10% emails and first 30% emails as training sets, 
respectively, and use the remaining 70% as the testing set. 
We construct two systems in a similar way as Section Re¬ 
run that trains directly on 30% emails, and Incremental 
that materialize using 10% emails and incrementally train 
on 30% emails. We run both systems and measure the test 
set loss after each learning epoch. 

Figureshows the result. We see that even with concept 
drift, both systems converge to the same loss. Incremen¬ 
tal converges faster than Rerun, and it is because even 
with concept drift, warmstart still allows Incremental to 
start from a lower loss at the first iteration. In terms of the 
time used for each iteration. Incremental and Rerun uses 
roughly the same amount of time for each iteration. This 
is expected because Incremental rejects almost all sam¬ 
ples due to the large difference between distributions and 
switch to variational approach, which, for a logistic regres¬ 
sion model where all variables are active and all weights 
have been changed, is very similar to the original model. 
From these result we see that, even with concept drifts, In- 
;CREMENTAL still provides benefit over rerunning the system 
from scratch due to warmstart; however, as expected, the 
benefit from incremental inference is smaller. 

C. ADDITIONAL RELATED WORK 

We provide additional related work, especially those work 
that impacts the current design of DeepDive’s language com¬ 
ponent. 

More Related KBC Systems. DeepDive’s model of KBG is 
motivated by the recent attempts of using machine learning- 
based technique for KBG [51,52,62,72,76,81,84] and the line 
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of research that aims to im prove t he quality of a specific 
component of KBC system 48 53, 56, 57, 60, 65-68, 70, 

73.74.77.78.82.83] . When designing DeepDive, we used 
these systems as test cases to justify the generality of our 
framework. In fact, we find that DeepDive is able to model 
13 of these popular KBC systems |32||48 57,61,65-68,70,73, 

74.82.83] . 


Other Incremental Algorithms. We build on decades of 
study of incremental view maintenance [55,59]: we rely on 
the classic incremental maintenance techniques to handle re¬ 
lational operations. Recently, others have proposed different 
approaches for incremental maintenance for individual an¬ 
alytics workflows like iterative linear algebr a p roblems [71], 
classification as new training data arrived [28] , or segmen¬ 
tation as new data arrives [10] . 

Factor graphs have been used by some probabilistic databases 
as the underlying representation |47| 75], but they did not 
study how to reuse computation across different, but similar, 
factor graphs. 
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